

# THE GENERATIVE AI AND THE FUTURE OF CLOUD AND DATA-CENTER

Thitipat J. thitipat@juniper.net





# AI/ML?



#### What is Machine Learning?



Artificial Intelligence Is a discipline





Machine Learning Is a subfield



#### Supervised learning Implies the data is already labeled





#### What is Machine Learning?





Deep learning uses Artificial Neural Networks – allowing them to process more complex patterns than traditional machine learning



#### What is Machine Learning?







# MODERN DATA-CENTER WORKLOAD



#### **Existing Workloads**





#### **Modern Workloads**

**GPU/TPU Acceleration** 

Parallel computing across servers

Disk read/write speed improvement

NVMe (ROCEv2/NVMeTCP)

Analytical / Training / Storage







### Existing vs Modern Workloads

#### **Existing Workloads**

- Heterogenous applications
- High number of tenants
- Workloads are loosely coupled
- GPU/TPU requirement is relatively less
- Relatively less throughput

#### Modern Workloads (AI/ML)

- Large computing problems
- Low number of tenants
- Workloads are tightly coupled
- GPU/TPU requirement is high
- Very high throughput



#### **New Product Mapping**





## **Modern Workloads Requirements**

| High Throughput and Density            | <ul> <li>Increase port speed</li> <li>Increase port density</li> </ul>                                                                                                 |
|----------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Reduced Job Completion Time (JCT)      | <ul> <li>1:1 subscription fabric design</li> <li>Reduce latency (e.g. cut through mode,TH-F1)</li> </ul>                                                               |
| Efficient Load Balancing               | <ul> <li>Dynamic/Global Load Balancing</li> <li>IP ECMP (64/128)</li> </ul>                                                                                            |
| • Reliable Transit                     | <ul> <li>ROCEv2 (PFC-IP/ECN) + (Source Flow Control, Congestion Isolation)</li> <li>Sub-second convergence time</li> <li>Deep buffer [TBD for AI/ML fabric]</li> </ul> |
| Zero trust security                    | <ul> <li>MacSec and overlay encryptions VxLAN-Sec</li> <li>DDoS-protection</li> </ul>                                                                                  |
| Juniper Apstra Intent-based operations | <ul> <li>Automated Deployments</li> <li>Easy scale-out and scale-in</li> <li>Telemetry, xFlow, and closed loop automation</li> </ul>                                   |
|                                        |                                                                                                                                                                        |

#### Hyper-scalar's DC



#### AI/ML block connect



. . .

800Ge

.....

800Ge

64x400G

128x200G

Option-2

64x400G

128x200G

800Ge

Block 32

Training Clusters

X64 x 400Ge

X32 x 400Ge



# THE EVOLUTION



#### **The Ethernet (R)evolution**

Past, present and future of ethernet transceiver sales across the industry



Adapted from Lightcounting, September 2022 High Speed Ethernet Optics Report

IUNIPE

## What's driving the need for 800GE/1.6TE?

The AI/ML Goldrush



• AI/ML "goldrush" heavily impacts major optics vendors, as they are expected to benefit from the increases use of 800G and beyond optics in large AI/ML clusters.

IUNIPE



### What's driving the need for 800GE/1.6TE?

The AI/ML Goldrush

- AI/ML clusters require A LOT of bandwidth:
- Latest generation of GPUs (Nvidia Hopper) use up to 3.6 Tbps as GPU-to-GPU interconnect for shared memory access.
- Front-end network of a high-end GPU server such as the Nvidia DGX H100 with 8 GPUs has 10 x 400G network interfaces (InfiniBand or Ethernet).

#### AI/ML clusters traditionally use InfiniBand:

- Better control over tail-end latency with (very) large flows going over the fabric.
- Hyperscalers prefer to adopt Ethernet instead:
  - Better scalability to larger clusters
  - Better suited for multi-tenant clouds running many different applications.
  - Latecy can be controlled by packet spraying and re-ordering [1]

[1] https://nvdam.widen.net/s/6lmkmc8lqg/nvidia-spectrum-xwhitepaper



#### Nvidia DGX H100

https://www.nvidia.com/en-us/data-center/dgxh100/

### What's driving the need for 800GE/1.6TE?

The AI/ML Goldrush

- AI/ML clusters for large language models take connectivity to even more extreme levels:
- Nvidia DGX GH200 has back-end network to interconnect 256 GPUs with 7.2 Tbps of GPU-to-GPU bandwidth and 921.6 Tbps bi-sectional BW.
- Shared memory access creates shared-GPUmemory space of 144 TB.



#### Nvidia DGX GH200



https://hc34.hotchips.org/assets/program/conference/day2/Network%20and%20Switches/NVSwitch%20HotChips%202022%20r5.pdf https://developer.nvidia.com/blog/nvidia-grace-hopper-superchip-architecture-in-depth/



# The evolution from 400G to 800G

## 800G adoption on routers and switches

Evolution to 100G Electrical I/O

 Industry is evolving from 50G to 100G electrical I/O, and number of SERDES per PFE increases:





Juniper Express 5 (28.8T, BXX)

288 x 100G



Juniper Express 5 (14.4T, BXF) **144 x 100G** 

For more details: Chang-Hong Wu, "Juniper's Express 5: A 28.8Tbps Network Routing ASIC and Variations", https://hc34.hotchips.org



#### 800G adoption on routers and switches

Evolution to 100G Electrical I/O

| 50G/lane<br>16/7 nm<br>PFE QSFP-DD                                        |                                                                  | 7/5 nm<br>PFE |                                | 00G<br>FP-DD                                            |  |
|---------------------------------------------------------------------------|------------------------------------------------------------------|---------------|--------------------------------|---------------------------------------------------------|--|
|                                                                           | <b>00G PAM4</b><br>otical I/O                                    |               | <b>00G PAM4</b><br>ctrical I/O | <b>8 x 100G PAN</b><br>Optical I/O                      |  |
| <b>100G SERDES</b><br>53 Gbaud PAM4 (106.25<br>Gbit/s/lane) using KP4 FEC | <b>8x100G PAM4 op</b><br>Backwards compati<br>mainstream 100G/40 | ble with      | Highly optimi                  | <b>GE break-out</b><br>zed for 8x100GE<br>)GE break-out |  |

• The adoption of 100G serial electrical I/O is the key building block for high-density 100GE/400GE-optimized routing and switching platforms





• Today's mainstream 100G/400G optics, i.e. 100G DR/FR/LR and 400G DR4/FR4/LR4 are forward compatible with 800G break-out

### **Break-out options for 800G ports**

Increased fan-out to support high-radix architectures



# 800G & 1.6T standardization

## 800G EthernEt "Time to Market"

#### ETC specification for 800G

- Data centers are now starting to deploy 800G ports:
- Using e.g switch silicon such as Broadcom Tomahawk 4
- Initially mainly for use as 2 x 400GE
- 800G Ethernet MSA specification released in 2020 by the Ethernet Technology Consortium (ETC)\*:
- ETC specification doubles bandwidth of 400GE to support 800G clear channel.
- Re-uses the PCS/FEC specification from 400GBASE-R.
- Effectively 2 x 400G PCS in parallel, i.e. 2 x (16 x 25G) PCS lanes
- Routers & switches supporting 800GE ports will start to become available in 2023~24
- Including Juniper Express 5 (BX) and Broadcom Jericho 3

\* The ETC was previously known as the 25 Gigabit Ethernet Consortium <u>https://ethernettechnologyconsortium.org/wp-content/uploads/2020/03/800G-Specification\_r1.0.pdf</u>



### 800GE & 1.6TE standardization



IEEE 802.3df & 802.3dj



• New Ethernet standards require fundamentally new component and system technologies to build consensus for standard with long-term relevance

